- Home
- Search Results
- Page 1 of 1
Search for: All records
-
Total Resources2
- Resource Type
-
0002000000000000
- More
- Availability
-
11
- Author / Contributor
- Filter by Author / Creator
-
-
Chen, Zhengyan (1)
-
Chu, Deming (1)
-
Guo, Zinan (1)
-
Rao, Sanjay (1)
-
Ribeiro, Bruno (1)
-
Sruthi, P C (1)
-
Sruthi, P. C. (1)
-
Zhang, Yongle (1)
-
#Tyler Phillips, Kenneth E. (0)
-
#Willis, Ciara (0)
-
& Abreu-Ramos, E. D. (0)
-
& Abramson, C. I. (0)
-
& Abreu-Ramos, E. D. (0)
-
& Adams, S.G. (0)
-
& Ahmed, K. (0)
-
& Ahmed, Khadija. (0)
-
& Aina, D.K. Jr. (0)
-
& Akcil-Okan, O. (0)
-
& Akuom, D. (0)
-
& Aleven, V. (0)
-
- Filter by Editor
-
-
& Spizer, S. M. (0)
-
& . Spizer, S. (0)
-
& Ahn, J. (0)
-
& Bateiha, S. (0)
-
& Bosch, N. (0)
-
& Brennan K. (0)
-
& Brennan, K. (0)
-
& Chen, B. (0)
-
& Chen, Bodong (0)
-
& Drown, S. (0)
-
& Ferretti, F. (0)
-
& Higgins, A. (0)
-
& J. Peters (0)
-
& Kali, Y. (0)
-
& Ruiz-Arias, P.M. (0)
-
& S. Spitzer (0)
-
& Sahin. I. (0)
-
& Spitzer, S. (0)
-
& Spitzer, S.M. (0)
-
(submitted - in Review for IEEE ICASSP-2024) (0)
-
-
Have feedback or suggestions for a way to improve these results?
!
Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
Debugging in production cloud systems (or live debugging) is a critical yet challenging task for on-call developers due to the financial impact of cloud service downtime and the inherent complexity of cloud systems. Unfortunately, how debugging is performed, and the unique challenges faced in the production cloud environment have not been investigated in detail. In this paper, we perform the first fine-grained, observational study of 93 real-world debugging experiences of production cloud failures in 15 widely adopted open-source distributed systems including distributed storage systems, databases, computing frameworks, message passing systems, and container orchestration systems. We examine each debugging experience with a fine-grained lens and categorize over 1700 debugging steps across all incidents. Our study provides a detailed picture of how developers perform various diagnosis activities including failure reproduction, anomaly analysis, program analysis, hypothesis formulation, information collection and online experiments. Highlights of our study include: (1) Analyses of the taxonomies and distributions of both live debugging activities and the underlying reasons for hypothesis forking, which confirm the presence of expert debugging strategies in production cloud systems, and offer insights to guide the training of novice developers and the development of tools that emulate expert behavior. (2) The identification of the primary challenge in anomaly detection (or, observability) for end-to-end debugging: the collection of system-specific data (17.1% of data collected). In comparison, nearly all (96%) invariants utilized to detect anomalies are already present in existing monitoring tools. (3) The identification of the importance of online interventions (i.e., in-production experiments that alter system execution) for live debugging - they are performed as frequently as information collection - with an investigation of different types of interventions and challenges. (4) An examination of novel debugging techniques developers utilized to overcome debugging challenges inherent to or amplified in cloud systems, which offer insights for the development of enhanced debugging tools.more » « lessFree, publicly-accessible full text available November 20, 2025
-
Sruthi, P. C.; Rao, Sanjay; Ribeiro, Bruno (, ACM Sigcomm Workshop on Network Meets AI & ML)
An official website of the United States government
